Welcome to Apache Spark with R

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

In this notebook we will introduce basic concepts about SparkSQL with R that you can find in the SparkR documentation, applied to the example people dataset. We will do two things, read data into a SparkSQL data frame, and have a quick look at the schema and what we have read.


In [1]:
##Creating a SparkSQL context and loading data¶
library(SparkR)
sc <- sparkR.session(sparkConfig = list(spark.app.name = "R Spark Test"))


Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

    as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
    rank, rbind, sample, startsWith, subset, summary, transform, union

Launching java with spark-submit command /usr/local/spark/bin/spark-submit   sparkr-shell /tmp/Rtmp3mWazA/backend_port9e113cc0dc 

In [3]:
people <- read.df("/opt/datasets/people.json", "json")

In [4]:
printSchema(people)


root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

In [5]:
head(people)


agename
1NA Michael
230 Andy
319 Justin

In [6]:
head(filter(people, people$age > 19))


agename
130 Andy